The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Human modeling and relighting are two fundamental problems in computer vision and graphics, where high-quality datasets can largely facilitate related research. However, most existing human datasets only provide multi-view human images captured under the same illumination. Although valuable for modeling tasks, they are not readily used in relighting problems. To promote research in both fields, in this paper, we present UltraStage, a new 3D human dataset that contains more than 2K high-quality human assets captured under both multi-view and multi-illumination settings. Specifically, for each example, we provide 32 surrounding views illuminated with one white light and two gradient illuminations. In addition to regular multi-view images, gradient illuminations help recover detailed surface normal and spatially-varying material maps, enabling various relighting applications. Inspired by recent advances in neural representation, we further interpret each example into a neural human asset which allows novel view synthesis under arbitrary lighting conditions. We show our neural human assets can achieve extremely high capture performance and are capable of representing fine details such as facial wrinkles and cloth folds. We also validate UltraStage in single image relighting tasks, training neural networks with virtual relighted data from neural assets and demonstrating realistic rendering improvements over prior arts. UltraStage will be publicly available to the community to stimulate significant future developments in various human modeling and rendering tasks.
translated by 谷歌翻译
In this paper, we aim to design an efficient real-time object detector that exceeds the YOLO series and is easily extensible for many object recognition tasks such as instance segmentation and rotated object detection. To obtain a more efficient model architecture, we explore an architecture that has compatible capacities in the backbone and neck, constructed by a basic building block that consists of large-kernel depth-wise convolutions. We further introduce soft labels when calculating matching costs in the dynamic label assignment to improve accuracy. Together with better training techniques, the resulting object detector, named RTMDet, achieves 52.8% AP on COCO with 300+ FPS on an NVIDIA 3090 GPU, outperforming the current mainstream industrial detectors. RTMDet achieves the best parameter-accuracy trade-off with tiny/small/medium/large/extra-large model sizes for various application scenarios, and obtains new state-of-the-art performance on real-time instance segmentation and rotated object detection. We hope the experimental results can provide new insights into designing versatile real-time object detectors for many object recognition tasks. Code and models are released at https://github.com/open-mmlab/mmdetection/tree/3.x/configs/rtmdet.
translated by 谷歌翻译
Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
translated by 谷歌翻译
The power of Deep Neural Networks (DNNs) depends heavily on the training data quantity, quality and diversity. However, in many real scenarios, it is costly and time-consuming to collect and annotate large-scale data. This has severely hindered the application of DNNs. To address this challenge, we explore a new task of dataset expansion, which seeks to automatically create new labeled samples to expand a small dataset. To this end, we present a Guided Imagination Framework (GIF) that leverages the recently developed big generative models (e.g., DALL-E2) and reconstruction models (e.g., MAE) to "imagine" and create informative new data from seed data to expand small datasets. Specifically, GIF conducts imagination by optimizing the latent features of seed data in a semantically meaningful space, which are fed into the generative models to generate photo-realistic images with new contents. For guiding the imagination towards creating samples useful for model training, we exploit the zero-shot recognition ability of CLIP and introduce three criteria to encourage informative sample generation, i.e., prediction consistency, entropy maximization and diversity promotion. With these essential criteria as guidance, GIF works well for expanding datasets in different domains, leading to 29.9% accuracy gain on average over six natural image datasets, and 12.3% accuracy gain on average over three medical image datasets. The source code will be released: \url{https://github.com/Vanint/DatasetExpansion}.
translated by 谷歌翻译
在多方转换方案中,重叠的语音检测(OSD)对于语音应用至关重要。尽管进行了许多研究工作和进展,与语音活动检测(VAD)相比,OSD仍然是一个开放的挑战,其总体表现远非令人满意。大多数先前的研究通常将OSD问题作为标准分类问题提出,以识别二进制(OSD)或三级标签(联合VAD和OSD)的语音。与主流相反,本研究从新的角度研究了联合VAD和OSD任务。特别是,我们建议使用多EXIT体系结构扩展传统的分类网络。这样的体系结构使我们的系统具有独特的功能,可以使用早期出口的低级功能或上次出口的高级功能来识别类。此外,采用了两种培训方案,知识蒸馏和密集的联系,以进一步提高我们的系统性能。基准数据集(AMI和DIHARD-III)的实验结果验证了我们提出的系统的有效性和通用性。我们的消融进一步揭示了拟议方案的互补贡献。在AMI上的$ F_1 $得分为0.792,而Dihard-III上的0.625分数,我们提出的系统在这些数据集上的表现优于几个顶级性能模型,但在两个数据集中也超过了当前的最新型号。除了性能收益外,我们提出的系统还为质量复杂性权衡提供了另一个吸引人的潜力,这是有效的OSD部署的高度优先。
translated by 谷歌翻译
为了解决单声道语音增强问题,已经进行了大量研究,以通过在语音混合物或时间域中学到的内域进行操作来增强语音,或者在时间域中 - 固定的全乐队短时间傅立叶的频率域变换(STFT)频谱图。最近,已经提出了一些关于基于子频段的语音增强的研究。通过通过子兰频谱图上的操作增强语音,这些研究表明了DNS2020基准数据集上的竞争性能。尽管有吸引力,但这个新的研究方向尚未得到充分探索,并且仍然有改进的余地。因此,在这项研究中,我们深入研究了最新的研究方向,并提出了一个基于子兰的语音增强系统,具有感知动机的优化和双重变换,称为PT-FSE。特别是,我们提出的PT-FSE模型通过三项努力改善了其主链(一种全频段和子融合模型)。首先,我们设计了一个旨在加强全局频率相关性的频率变换模块。然后引入时间转换以捕获远距离时间上下文。最后,提出了一种新的损失,具有人类听觉感知的性质杠杆作用,以促进该模型专注于低频增强。为了验证我们提出的模型的有效性,在DNS2020数据集上进行了广泛的实验。实验结果表明,我们的PT-FSE系统在其骨架上取得了重大改进,但也比当前的最新面积胜过,而比SOTA小27%。在基准数据集上,NB-PESQ平均为3.57,我们的系统提供了迄今报告的最佳语音增强结果。
translated by 谷歌翻译
元强化学习(META-RL)是一种有前途的方法,使代理商能够快速学习新任务。但是,由于仅由奖励提供的任务信息不足,大多数元元素算法在多任任务方案中显示出较差的概括。语言条件的元RL通过匹配语言指令和代理的行为来改善概括。因此,从对称性学习是人类学习的一种重要形式,因此将对称性和语言指令结合到元素rl可以帮助提高算法的概括和学习效率。因此,我们提出了一种双MDP元提升学习方法,该方法可以通过对称数据和语言指令有效地学习新任务。我们在多个具有挑战性的操作任务中评估了我们的方法,实验结果表明我们的方法可以大大提高元强化学习的概括和效率。
translated by 谷歌翻译
电子商务在通过互联网增强商人的能力方面已经大有帮助。为了有效地存储商品并正确安排营销资源,对他们来说,进行准确的总商品价值(GMV)预测非常重要。但是,通过数字化数据的缺乏进行准确的预测是不算平的。在本文中,我们提出了一个解决方案,以更好地预测Apay应用程序内的GMV。得益于Graph Neural网络(GNN),它具有很好的关联不同实体以丰富信息的能力,我们提出了Gaia,Gaia是一个图形神经网络(GNN)模型,具有时间移动意识注意。Gaia利用相关的电子销售商的销售信息,并根据时间依赖性学习邻居相关性。通过测试Apleay的真实数据集并与其他基线进行比较,Gaia表现出最佳性能。盖亚(Gaia)部署在模拟的在线环境中,与基线相比,这也取得了很大的进步。
translated by 谷歌翻译
在生成对抗网络(GAN)中操纵潜在代码的面部图像合成主要集中于连续属性合成(例如,年龄,姿势和情感),而离散属性合成(例如面膜和眼镜)受到较少的注意。直接将现有作品应用于面部离散属性可能会导致结果不正确。在这项工作中,我们提出了一个创新的框架,以通过语义分解,称为SD-GAN来解决具有挑战性的面部离散属性合成。要具体,我们将离散属性表示形式明确分解为两个组件,即语义先验和偏移潜在表示。语义先验基础显示了在潜在空间中操纵面部表示的初始化方向。提出了通过3D感知语义融合网络获得的偏移潜在呈现,以调整先前的基础。此外,融合网络集成了3D嵌入,以更好地身份保存和离散属性合成。先前基础和抵消潜在表示的组合使我们的方法能够合成具有离散属性的照片真实面部图像。值得注意的是,我们构建了一个大型且有价值的数据集MEGN(从Google和Naver捕获的面膜和眼镜图像),以完成现有数据集中缺乏离散属性。广泛的定性和定量实验证明了我们方法的最新性能。我们的代码可在以下网址找到:https://github.com/montaellis/sd-gan。
translated by 谷歌翻译